Community Detection Using A Neighborhood 
Strength Driven Label Propagation Algorithm 



Jiemi Xie and Boleslaw K. Szymanski 
Department of Computer Science 
Rensselaer Polytechnic Institute 
110 8th Street 
Troy, New York 12180 
Email: {xiej2,szymansk} @cs.rpi.edu 



o 



00 



> 

rn 
o 



X 



Abstract — Studies of community structure and evolution in 
large social networks require a fast and accurate algorithm for 
community detection. As the size of analyzed communities grows, 
complexity of the community detection algorithm needs to be kept 
close to linear. The Label Propagation Algorithm (LPA) has the 
benefits of nearly-linear running time and easy implementation, 
thus it forms a good basis for efficient community detection 
methods. In this paper, we propose new update rule and label 
propagation criterion in LPA to improve both its computational 
efficiency and the quality of communities that it detects. The 
speed is optimized by avoiding unnecessary updates performed 
by the original algorithm. This change reduces significantly (by 
order of magnitude for large networks) the number of iterations 
that the algorithm executes. We also evaluate our generalization 
of the LPA update rule that takes into account, with varying 
strength, connections to the neighborhood of a node considering 
a new label. Experiments on computer generated networks and 
a wide range of social networks show that our new rule improves 
the quality of the detected communities compared to those 
found by the original LPA. The benefit of considering positive 
neighborhood strength is pronounced especially on real-world 
networks containing sufficiently large fraction of nodes with high 
clustering coefficient. 

I. Introduction 

A network is said to have community structure if it can be 
divided into groups with dense connections within groups and 
sparse connections between groups. Community detection has 
received significant attention in physics and computer science 
communities. It has been applied to networks of many kinds, 
including the World Wide Web (WWW) |1 1, |2|, collaboration 
networks |3|, communication networks, social networks [4], 
biological networks ID, and so on. This problem has been 
studied as the graph partitioning problem in computer science 
for decades and is known to be NP-hard. Algorithms that find 
reasonably good quality communities have been proposed [6 1 
and improved extensively. To name a few, they include divisive 
algorithms that recursively remove links from the network [7|, 
agglomerative algorithms that repeatedly merge smaller groups 
of nodes |8|, |9|, maximization of modularity algorithms using 
spectral clustering, simulated annealing or extremal optimiza- 
tion lUni, iH, ini, lO and so on. The quahty of detected 
communities can be measured using modularity defined by 
Newman ||8] . A network which has its modularity in the range 
between 0.3 and 0.7, usually has a strong community structure. 
Despite great efforts, the cost of detecting unknown number 



of communities of unknown size in an arbitrary complex 
network remains high. For large networks, algorithms with 
complexity of 0{n?), where n denotes the number of nodes 
in the network, become prohibitively expensive in terms of 
their execution time. 

Recently, Raghavan |fT4l| proposed a method called label 
propagation algorithm (LPA) to identify community in large 
networks, which runs linearly in the number of edges, thus 
linearly also in the number of nodes for sparse networks. 
Initially, each node is assigned a unique label. During the 
iterative process, each node adopts the label in agreement with 
the majority of its neighbors. At the end of the algorithm, 
connected nodes with the same label form a community. This 
algorithm provides a number of desirable qualities such as 
no parameters, easy implementation and fast execution for 
practical networks. In this paper, we empirically study and 
analyze a generalized update rule for LPA. 

II. Related work 

In this section, we review community detection algorithms 
proposed in the literature. 

Girvan and Newman ifTSl propose a divisive hierarchical 
clustering algorithm, referred to as GN, which consists of 
four steps: 1) Calculate the betweenness for all edges in the 
network. 2) Remove the edge with the highest betweenness. 3) 
Recalculate betweenness for all edges affected by the removal. 
4) Repeat from step 2 until no edges remain. The algorithm 
utilizes the non-local structure information, thus it works well 
on real-world networks. However, a particular disadvantage of 
GN is that it runs in 0{m?n) on a network of n nodes and 
m edges or O(n^) on a sparse network. Newman proposed 
a faster algorithm, referred to as NM, in ||8] with running 
time 0((m + n)n) or O(n^) on a sparse network. NM is 
an agglomerative hierarchical clustering algorithm that starts 
with a state in which each node is a single community. Then, 
it repeatedly merges pairs of communities together, choosing 
at each step the merger that results in the greatest increase in 
modularity (termed Q). In its faster version, called CNM ifTOl . 
the running time is reduced to 0{rad\ogn), where d is the 
depth of the dendrogram describing the network's community 
structure. On a sparse network, it runs in O(nlog^ri). It 
is known that NM has a resolution limit, failing to find 



communities with sizes smaller than a certain value. A lot 
of work has been done to improve GN and NM. For example, 
some improvements attempt to strike a balance between the 
community size and the gain in the modularity with various 
refinement strategies (16], (TT\. These methods usually have 
complexity comparable to the original NM algorithm. 

Another fast greedy algorithm based on modularity opti- 
mization, called Lou vain method is proposed in |TF|. The 
method consists of two phases. First, it looks for "small" 
communities by optimizing modularity locally. Second, it 
aggregates nodes of the same community and builds a new net- 
work whose nodes are the communities found at the previous 
phases. These steps are repeated iteratively until a maximum 
of modularity is attained. From extensive experiments, the 
complexity of this method scales as 0{n\ogn) even though 
this has not been formally proved. Other algorithms seeking 
the maximization of modularity use various techniques such as 
spectral clustering, simulated annealing or extremal optimiza- 
tion ifTOl . ifTTl . lfT2l . lfT3l and so on. These methods usually 
obtain higher values of modularity than the original NM. 

Random walk has successful applications in finding com- 
munity IIT9I , II20I , II2TI . The idea behind this approach is that 
the walk tends to be trapped in dense parts of a network corre- 
sponding to communities. For these algorithms, the complexity 
of computing distance or proximity between all pairs of nodes 
exactly is 0{n^). Some approximation techniques are usually 
used. WaUcTrap (WT) proposed in [21] is built on a measure 
of similarity between nodes based on random walks. WT 
has the time complexity of 0{mii?) but runs in 0{n^ logri) 
in most real-world cases. Markov Cluster Algorithm (MCL) 
proposed in ll22l is an unsupervised clustering algorithm based 
on simulations of flow. In some sense, it is a random walk 
with decay. By keeping only a maximum number k of non- 
zeros elements in each column when computing the matrix 
multiplication, the complexity is down to 0{nk^) on a sparse 
network. However, this algorithm is sensitive to the parameter 
called inflation. 

Spectral clustering f23\, f24], f25\ first embeds a network 
in space and then uses a fast clustering algorithm to find 
communities. The space is spanned by eigenvectors. The 
spectral optimization method proposed by Newman (IT] runs 
in O(n^) on a sparse network. White and Smyth (WS) propose 
a fast spectral clustering algorithm in ll24l . They reformulate 
the problem of modularity optimization as a discrete quadratic 
assignment problem. Then they relax it as a continuous one 
which can be solved by eigen-decomposition. Their algorithm 
uses the Implicitly Restarted Lanczos Method (IRLM) and k- 
mean and has complexity of 0{mKh+nK^h+K^h+nK^t), 
where the first three terms represent complexity of IRLM 
while the last one represents the complexity of execution of 
k-mean. m and n denote the number of edges and nodes, 
K stands for the maximum number of eigenvectors, h is the 
number of iterations required for IRLM to converge, and t 
denotes the number of iterations of k-mean algorithm. On 
a sparse network, the algorithm scales roughly linearly as a 
function of n. 



Multi-state spin models Eg), EH, d, EU, IH (e.g., q- 
state Potts model), in which a spin is assigned to each node in 
a network, can also be applied to community detection. In such 
a setting, community detection is equivalent to minimizing the 
Hamiltonian of the model. The corresponding algorithms are 
related to the label propagation algorithms discussed below 
and usually are fast. However, they may require some prior 
knowledge of the networks structure (for example knowing a 
pair of nodes each of which belongs to a different community) 
in order to be able to apply them to community detection (e.g.. 
Ferromagnetic Random Field Ising Model ll28l ). 

The idea of propagating labels through a network has been 
studied by Bagrow ||3T1 in his L-shell method. Starting from 
a node with a label, the algorithm propagates the label step 
by step and includes more neighbor nodes until the end of 
a community is reached. The boundary of a community is 
identified by the threshold defined as the ratio of the number 
of edges inside and outside of the community. Similar idea is 
studied by Costa in ll32l . Wu ll33l proposed a method which 
partitions a network into two communities. The network is 
viewed as an electric circuit, and a battery is attached to two 
random nodes that are supposed to be within two communities. 
The algorithm amounts to solving Kirchhoff equations, with 
two of them fixed to be and I. In other words, each node 
updates its value (i.e., voltage) by taking the average of all 
neighbors' value. When the process converges, the voltage 
gap indicates the border, and two communities are identified. 
Although this method can be generalized to detecting multiple 
communities, it requires the number of communities as the 
input, and tends to find communities of approximately the 
same size. 

The LRA. |T4'| uses the network structure alone to guide its 
process and requires neither any parameters nor optimization 
of the objective function. It starts from a configuration where 
each node has a distinct label. At every step, one node (in 
asynchronous version) or each node (in a synchronous version) 
makes its own decision to change its label to the one carried 
by the largest number of its neighbors. By construction, as the 
algorithm converges, each node has more neighbors in its own 
community than in any of other community. One drawback of 
LPA is that it returns different solutions (some of them of poor 
quality) in different realizations. This is because the quality 
of LPA solution depends on the local minima it reaches. 
Tibely and Kertesz f34l show that this model is equivalent 
to finding the local minima of a simple Potts model |[30|. The 
number of such local minima was found to be much larger 
than the number of nodes in the underlying network. Barber 
[35] defines an equivalent objective function based on the 
number of edges that connect vertices with identical labels that 
penalize the low quality solutions. Leung fW] extends LPA by 
incorporating heuristics like hop attenuation score to improve 
the quality of the detected communities. Gregory |[37l applies 
the similar idea to detection of overlapping communities. Each 
vertex is allowed to belong to up to v communities, where v 
is the parameter of the algorithm. 

In this paper, we enhance the LPA by introducing new 
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TABLE I 

The number of iterations (scaled by n) required for 
convergence on social networks 
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update and label propagation rules that achieve higher speed 
of execution and improve the quality of the communities 
detected. Although the updating process in LPA can either 
be synchronous or asynchronous, we restrict our attention to 
only asynchronous version here. 

III. Improving the Speed of LPA 

Although LPA runs in linear time, there are ways to im- 
prove the execution time in practice, and such improvements 
are essential when we process extremely large networks or 
move from offline to online detection. The basic idea of our 
improvement is to avoid unnecessary updates in each iteration 
of the algorithm, while maintaining the overall behavior of the 
algorithm unchanged. As observed, at the early stage of the 
original LAP, most nodes are in a very diverse neighborhood. 
The effectiveness of updates (i.e., the fraction of attempted 
updates that result in changes to new labels) is high. However, 
the competition between communities is restricted only to 
their boundaries after a few iterations. For nodes inside any 
community, the updates are unnecessary, since they essentially 
do not change. As shown in fPil . after five iterations, 95% 
of nodes are already correctly clustered. Additional time is 
required to attempt the updates that are expected to fail to 
change labels, so the final convergence of the algorithm is 
delayed. It turns out that this amount of time can be easily 
saved by bookkeeping the information about the boundaries 
of the currently existing communities. There was an initial 
attempt along this line of the LAP speed improvement fSQ. 
However, unlike that attempt, our improvement requires nei- 
ther any threshold value nor modification of the stop criterion. 
Moreover, our newly introduced update rule causes attempted 
updates to be highly effective. 

We refer to a node whose all neighbors have the same label 
as it does as an interior node. Nodes that are not interior 
are called boundary nodes. A node that would not change its 
label if it were to attempt an update is referred to as passive, 
while the node that is not passive is called active. Clearly, all 
interior nodes are passive by definition. On the other hand, a 
boundary node could be either active or passive, depending on 
its neighborhood. Hence, each node may be in one of the three 
states: passive interior, passive boundary or active boundary. In 
general, the update rule itself defines a natural end of execution 
condition, which allow the algorithm to finish execution when 



every node becomes passive, the situation to which we refer 
as convergence of a network. Moreover, we maintain a list 
called active node list that contains all currently active nodes. 
The general outline of the LPA improvement is as follows: 

1) At time t=0, construct the active node list containing 
all the nodes. 

2) Randomly pick an active node, say i, from the 
Ust and attempt to adopt a new label according to 
the update rule. Since only active nodes are placed 
initially on the list and they remain on the list as long 
as they are active, each node selected for an update 
will change its label during the update. 

3) First, check if the updated node became passive and 
if so, remove it from the list. Next, check all its 
neighbors for the change of status in the following 
three steps. (1) If an interior neighbor became an 
active boundary node, add it to the active node 
list. (2) Remove any previously active neighbor that 
became passive from the active node list. (3) Add any 
previously passive boundary neighbor that became 
active to the active node list. 

4) If the active node list is empty, stop; otherwise, 
increase time t by one unit and go to step 2. 

The complexity of the improved algorithm is unchanged. 
Initialization of the active node list requires 0{n) time. 
Randomly selecting one node takes 0(1) time. Updating the 
node i and its neighbors requires 0{di), where di is the 
degree of node i. By using the active node list, evaluating the 
convergence of the whole network is easy and takes exactly 
0(1) by checking if the list is empty. In our improvement, the 
number of iterations needed for the algorithm to converge is 
equal to the total number of effective updates. 

To evaluate the efficiency of our improvement, we have run 
the original LPA and the LPA with our improvement on a 
wide range of social networks. The two tested algorithms are 
denoted as org-LPA and speed-up-LPA in Table H] respec- 
tively. Social networks used throughout the paper include: 
1) karate: Zachary's karate club network [38|; 2) football: 
the schedule of games of US college football teams |7]; 3) 
lesmis: interactions between major characters in Victor Hugo's 
novel Les Misarables |39|; 4) polbooks: books on American 
politics co-purchased on Amazon.com ll40l : 5) netscience: a 
network of authors publishing articles on network science 
[41]; 6) email: a social network in a company implied by 
the interactions via email [42 1; 7) eva: a network of a US 
company; 8) PGP: a network of users of the Pretty-Good- 
Privacy algorithm for secure information interchange [l43l ; 
9) CA-GrQc: a network of researchers in General Relativity 
publishing in Arxiv on that topic [44 1. Note that in all these 
networks, only the largest connected component is used. 

We measured the speed in terms of the number of iterations 
scaled by the network size n. We repeated each experiment 
100 times and reported the average. As shown in Table H] the 
new framework does not depend on the network size, and for 
network size up to ten thousands of nodes, the scaled number 
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of iterations remains below 3. The speed of the algorithm 
improves by a factor of at least 1.5 for all tested networks, 
but on larger networks, like email, eva, CA-GrQc and PGP, 
the algorithm is 6 or more times faster than the original LPA. 

IV. A Generalized Update Rule 

A. Neighborhood Strength Driven LPA 

The propagation of a label is analogous to epidemic, idea, 
opinion and information spreading in a network. By assuming 
that a node always adopts the label of the majority of its 
neighbors, the LPA ignores any structures existing in this 
node neighborhood. This makes the algorithm very simple. 
However, in reality, a person adopting a new idea, often fol- 
lows a neighbor who has more connections to other neighbors 
because this neighbor has higher number of potential sources 
of information. For the same reason, when a node joining a 
group (i.e., changing its label to the one shared by this group) 
may take into account not only how many members are in this 
particular group (like the original LPA does) but also how well 
they are connected to other neighbors of the node executing 
the update label rule. Following this idea, we generalize the 
update rule of LPA as follows: 

L{i) = L (^argmax{^(Cfe)}^ 

where L denotes the label of a node or a community. Ck is the 
sub-community containing a set of nodes connected to node 
i and sharing among themselves the same label k. S is the 
score function of a sub-community defined as 

SiCk)^ ^{l + c./i,«} 

The first term accounts for the direct link from node j to 
node i. The second term represents the new generalized rule 
of LPA. Here hj {i) is the number of links from node j to the 
entire neighborhood of node i, excluding node «. c is a weight 
between and 1 , indicating how much we want to weight the 
impact of node j on other neighbors. When c = 1, we place 
the same weight on all the links in the neighborhood; when 
c = 0, we discount those links, except the direct connection 
to node i, which reduces the new rule to the original LPA 
rule. For simplicity, we do not account for links from j to its 
own community or to other communities. Note that we still 
restrict the process to a local neighborhood of a node, and do 
not consider links that go outside of this local neighborhood. 

Each value of c provides some guidance for the label 
propagation. The ties between labels to choose from in this 
process, broken randomly, contribute to the random output 
of the algorithm. Assigning the weight c is non-trivial, since 
LPA has a counterintuitive nature, in which the communities 
are formed around some local minima instead of the globally 
optimum value. Hence, the value of c that provides a good 
balance between converging quickly and not getting trapped 
in the undesired local minima leads to better results. From the 
experiments below, we find that there is a connection between 
the factor c and the node Clustering Coefficients (CC) [45 1. 



V. Evaluation of Performance 

To test the performance of the generalized update rule, 
we incorporate it into the modified LPA framework in sec- 
tion Hn] to create a new community detection algorithm 
and apply it to both computer generated networks and real- 
world social networks. In the experiments, we explore dif- 
ferent values of c. More specifically, we analyze the quality 
of community detection for values of c taken from a set 
{0,0.05,0.25,0.65,0.8,1}. When c=l, all links are equally 
important, while for c=0 (i.e., for original LPA), the links to 
other neighbors, except those to the node under consideration, 
are completely ignored. Values of c above and below 1 define 
how much we favor one of these two extremes. 

During the asynchronous updating scheme, in each step, 
there is choice of the node to which the update rule is to be 
applied next. In addition, when there is a tie among labels 
with the highest scores, there is also a choice of the final 
label assigned to the node executing the update. Communities 
detected depend on what selections are made for these choices 
because each selection may trap the solution in a local min- 
imum. Usually, these selections are done randomly, in which 
case every run may produce different outcomes. Therefore, 
statistical measures of quality of communities detected, such 
as the average and the best, are all important metrics of the 
algorithm performance. 

A. Tests on Computer Generated Networks 

1 ) Benchmark networks and quality measures: The reason 
for using computer generated networks is that these networks 
have well-defined community structures, i.e., we know the 
pre-assigned true label of each node. We adopted the LFR 
benchmark f46l, which is a special case of the planted l- 
partition model |47|. LFR networks are similar to real-world 
networks because they are characterized, Uke most of the 
real-world networks, by heterogeneous distributions of node 
degrees and community sizes. In our experiments, we used 
the following fixed parameters: node degrees and community 
sizes are governed by the power law, with exponents 2 and 
1 1 47 1; the maximum degree is 50; the community sizes vary 
between 10 and 50; the network size is set at = 1000 and 
the average degree is kept at < fc 5. We varied the mixing 
parameter /i, which is the expected fraction of links of a node 
connecting to other communities. In other words, each node 
has (1 — /i)- < k > intra community links on average. The 
larger the value of /i is, the weaker the community structure 
is. 

Many measures have been proposed for quantifying the 
quality of a partition from a detection algorithm with respect 
to the known true partition. Each of them has its advantage 
and disadvantage. In this paper, we carefully have chosen 
two of them. In [61, Danon proposed to use the Normalized 
Mutual Information (NMI) for network partition, measuring 
the amount of information correctly extracted by the detection 
algorithms. NMI is shown to be reliable and is used often in 
physics literature. The rand index is a measure of the similarity 
between two partitions, indicating how much they agree in 
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Fig. 1. Normalized Mutual Information of LPA with various c's on 
LFR networks with N = 1000 and < k >= 5. 




Fig. 3. Nomialized Mutual Information of different detection algo- 
rithms on LFR networks with N = 1000 and < fc >= 5. 

terms of pairs of nodes. We use its adjusted-for-chance form, 
namely the Adjusted Rand Index (ARI) [48 1. Both NMI and 
ARJ have value 1 for a perfect match and for a random or 
independent partition |49|. 

For comparison, two algorithms, extremal optimization 
(short for ExtOpt) Q and MCL 0, are included in the ex- 
periments as references. ExtOpt is a modularity maximiza- 
tion algorithm, which usually obtains high modularity. MCL 
performs well and is fast in practice. We used 1.4 for the 
parameter inflation, which achieved good results. For our 
generalized LPA algorithms, we repeated each run 10 times 
and kept the maximum scores. For each /i, an average values 
over 10 realizations of networks are reported. 

2) Performance analysis: Fig. [T] and Fig. |2] demonstrate 
that the generalized LPA maintains similar behavior when the 

http://deim.urv.cat/~ aarenas/data | 
' http://micans.org/md7] 
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Fig. 2. Adjusted Rand Index of LPA with various c's on LFR networks 
with N = 1000 and < A: >= 5. 
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Fig. 4. Adjusted Rand Index of different detection algorithms on LFR 
networks with = 1000 and < k >= 5. 

neighborhood strength is varied (we omitted c = 0.25 because 
it is close to c = 0.05). When /i is relatively small, the strength 
(c > 0) does not play a significant role, resulting in similar 
performance for different c's. However, when fi exceeds 0.35, 
LPA with c > perform better Although the average degree 
is fixed by construction (so is the degree distribution), we 
observed that the distribution of node clustering coefficients 
(CC) changes significantly (from CC = 0.54 for /i = 0.1 
to CC = 0.25 for fi = 0.3). Since each community is 
connected in the manner similar to a random graph, increasing 
fi leads to smaller average clustering coefficients in the same 
community. During the updating, the majority rule of LPA 
becomes weaker Therefore, adopting the label from a group of 
neighboring nodes with same label and more intra connections 
reduces the effect of the underlying structure change. This 
results in a stable performance of LPA with c > for 
0.3 < II < 0.5. Another explanation of the benefit of having 
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c > is that it tends to maintain the tension between 
communities during the evolution. Such tension would trap 
LPA in a sub-optimal solution. However, with c = 0, LPA 
loses the tension more quickly. This is especially pronounced 
for /I > 0.5, in which case the algorithm yields a single 
community most of the time, ending with very low average 
performance in terms of quality of communities detected. 

As shown in Fig. [5] and Fig. H] among MCL, ExtOpt 
and LPA, the LPA outperforms the others consistently for 
a wide range of /i < 0.5 for which community structure 
is termed strong |50|, because each node has more links to 
its own community than to the nodes outside it. In fact, the 
performances of all algorithms drops sharply before /i = 0.5 
because the tested networks are sparse which makes detection 
harder LPA with c ~ performs worst beyond fi — 0.5. For 
LPA with c > 0, the neighbor strength advantage is carried up 
to jj, — 0.65, beyond which all versions of LPA find almost 
only the trivial solution, i.e., single community. Comparing 
NMI and ARI measures, it is interesting to observe that ARI 
is more sensitive to the performance of an algorithm than 
NMI. For example, sharper change is observed in ARI plots, 
especially for LPA with c = 0. 

B. Tests on Real-world Social Networks 

As in section |III1 we repeated each experiment 100 times. 
The quality of the detected communities is measured by the 
modularity Q |8|. In Table we separated c=0 (i.e., LPA) 
and c>0 for comparison. 

7 ) Maximum performance: In Table |ll] we report the max- 
imum modularity obtained in tests. LPA-Q (c=0) denotes the 
highest performance for the original LPA and LPA-Q (oO) 
denotes the highest performance of the algorithm with the new 
rule (the positive value of c with which that performance was 
achieved is shown in parenthesis). We also list the modularity 
either reported in the literature or obtain by other algorithms as 
a reference. On karate network, all algorithms with different 
weights achieve the same maximum modularity (0.416) due 
to the small size of the network and its limited structure 
variation. On lesmis network, a c=0.05, slightly divergent from 
0, yields higher modularity than LPA. Football network is a 
special case, on which the highest modularity (still, by very 
little margin) is achieved with c=0. For other networks, higher 
modularity is obtained with c=LO (or c=0.25) than with c=0, 
which shows that the neighbor connections (c>0) provide 
useful information for guiding the evolution of the algorithm. 

2) Average performance and stability: By repeating many 
runs, one can measure the average performance to evaluate 
variability of the quality of communities detected in those 
runs. Table |III] shows the results for networks with at least 
100 nodes. In our experiments, we observed that some LAP 
runs reached complete consensus (i.e., the result is a single 
community, in which Q=0). For algorithms with oO, this 
rarely happens. We remove such cases when we compute the 
average modularity for LPA. As shown, algorithm with c>0 
obtains higher average performance than LPA with c=0 on 
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Evaluation OF clustering quality by maximum modularity 
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most networks. A trend that favors oO is clearly shown, ex- 
cept for football network. In other words, as c increases (more 
weight for neighbor connections), the algorithm becomes more 
stable on most of these networks. One observation from larger 
networks (n> 1000) is that a c that achieves a higher maximum 
modularity also yields a better average performance. 

TABLE III 

The average modularity of LPA with various c's 



Network 


c=1.0 


c=0.8 


c=0.65 


c=0.25 


c=0.05 


LPA(c=0) 


football 


0.567 


0.568 


0.568 


0.576 


0.577 


0.590 


polbooks 


0.521 


0.521 


0.519 


0.509 


0.507 


0.487 


netscience 


0.804 


0.804 


0.804 


0.803 


0.802 


0.798 


email 


0.490 


0.485 


0.471 


0.408 


0.298 


0.230 


eva 


0.919 


0.916 


0.911 


0.891 


0.890 


0.890 


CA-GrQc 


0.756 


0.753 


0.753 


0.750 


0.748 


0.752 


PGP 


0.830 


0.824 


0.822 


0.807 


0.802 


0.802 



3) The weight factor c and the clustering coefficient dis- 
tribution: Both Table |ll] and Table |lll] show that for many 
networks, oO (often c=l) yields better performance than c=0. 
Our conjecture is that c is strongly related to both degree 
distribution and clustering coefficient distribution. Given that 
all tested networks are real-world networks, their degree 
distributions are similar. Hence, we discuss here the clustering 
coefficient distribution. In Fig. |5] we show the cumulative 
probability distribution of clustering coefficient (abbreviated as 
cc), i.e., P{CC < cc), where cc = z/{di{di — l)/2) for a node 
with z links in the neighborhood (for di = 1, cc = 1). All cc's 
are clustered in bins with bin width 0. L In the case where the 
algorithm favors smaller c, e.g., football network, we observe 
a distribution shown in Fig. |5] (blue). If we consider nodes 
with coO.9 as highly clustered, then in football network, 
most nodes are not strongly clustered (with cc<0.6). Another 
distribution in Fig. |5] (red) presents a different feature, that 
is, nodes with smaller cc are roughly uniformly distributed, 
and there is a large fraction of nodes with high cc. For 
example, in netscience network (a co-authorship network), 
there are many small communities with either a key node 
(a core researcher) connected either to many isolated nodes 
or a strong inter-connected group (e.g., a research lab), both 
of which account for large values of cc for the coiTesponding 
nodes. Since the new rule with positive c tends to prefer a sub- 
community with more connections inside the neighborhood 
(in some sense it implies that such sub-community also has 
stronger interconnections), it is consistent with the feature 
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clustering coefficienl 



Fig. 5. The clustering coefficients distribution for the football network with 
average cc=0.4032 and for netscience network with average cc=0.8125. 

observed. Therefore, it is not surprising that the new rule 
works more efficiently than LPA (with c=0) on networks like 
netsience, email, eva, CA-GrQc and PGP that all have similar 
shape of clustering coefficient distribution. 

4) Community size: Like in the case of the LPA algorithm, 
the community size found by the algorithm with oO follows 
a power law distribution, P{S > s) oc s". The exponent a es- 
timated by the clustering result with the maximum modularity 
for the PGP network (see Fig. |7]with c=1.0) is about -L28. 
For the two-part power law for the email network (see Fig. 
|6]with c=0.25), two values of a are -1.65 and -0.45. This is 
consistent with previous observations lfT4l discussed in ifTOl . 

U. 

VL Conclusions 

In this paper, we presented a new community detection 
algorithm that improves both the speed and quality of detected 
communities when compared to the original LPA algorithm. 
The generalized update rule allows us to incorporate useful 
neighborhood information. Both maximum and average de- 
tected community quality improves for most of the tested 
networks. The parameter c is related to an interesting feature 
of the networks, i.e., the clustering coefficient distribution, 
that explains the difference in optimal value of c for many 
different networks. However, the selection of c is not yet fully 
understood, and therefore it is the subject worthy of further 
study. Extending our approach to overlapping and online 
community detection is another future research direction that 
we plan to pursue. 
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